Skip to content

feat(grpc): enforce a maximum number of reconnection attempts#902

Open
Molter73 wants to merge 1 commit into
mainfrom
mauro/feat/crash-on-reconnect-fail
Open

feat(grpc): enforce a maximum number of reconnection attempts#902
Molter73 wants to merge 1 commit into
mainfrom
mauro/feat/crash-on-reconnect-fail

Conversation

@Molter73

@Molter73 Molter73 commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

Description

If fact fails to connect to the gRPC server in the specified number of attemtps it will crash, which can be used as a clear signal in k8s that something is not working as expected. If the number of retries is set to 0, there will be no limit to the amount of times fact attempts to reconnect.

In order to allow the reconnection failure to trigger an application wide crash, the main output task monitors the result of the grpc task's handle propagating the error further up. This method should allow for other output components to be added in the future and follow this same pattern without having to change anything outside the output module. The stdout component is not included in this logic because it has no condition that could merit an application wide crash.

Checklist

  • Patch has a change log entry OR does not need one.
  • Investigated and inspected CI test results
  • Updated documentation accordingly

Automated testing

  • Added unit tests
  • Added integration tests
  • Added regression tests

If any of these don't apply, please comment below.

Testing Performed

Pointed fact to connect to a non-existant gRPC server and got it to crash as expected.

fact output on crash
[INFO  2026-06-24T16:55:17Z] FactConfig {
    paths: Some(
        [
            "/etc/sensitive-files/**/*",
            "/etc/sensitive-files",
        ],
    ),
    grpc: GrpcConfig {
        url: Some(
            "http://localhost:9999",
        ),
        certs: None,
        backoff: BackoffConfig {
            initial: None,
            max: Some(
                1s,
            ),
            jitter: Some(
                false,
            ),
            multiplier: None,
            retries_max: Some(
                3,
            ),
        },
    },
    endpoint: EndpointConfig {
        address: None,
        expose_metrics: Some(
            true,
        ),
        health_check: None,
    },
    bpf: BpfConfig {
        ringbuf_size: None,
        inodes_max: None,
    },
    skip_pre_flight: None,
    json: None,
    hotreload: None,
    scan_interval: None,
    rate_limit: None,
}
[INFO  2026-06-24T16:55:17Z] fact version: 0.3.x-104-g669f5be002
[INFO  2026-06-24T16:55:17Z] OS: Fedora Linux 44 (Workstation Edition)
[INFO  2026-06-24T16:55:17Z] Kernel version: 7.0.12-201.fc44.x86_64
[INFO  2026-06-24T16:55:17Z] Architecture: x86_64
[INFO  2026-06-24T16:55:17Z] Hostname: mmoltras-thinkpadp1gen5.remote.csb
[INFO  2026-06-24T16:55:17Z] Starting BPF worker...
[WARN  2026-06-24T16:55:17Z] Invalid initial value: 1 >= 1
[INFO  2026-06-24T16:55:17Z] Attempting to connect to gRPC server...
[INFO  2026-06-24T16:55:17Z] Starting host scanner...
[WARN  2026-06-24T16:55:17Z] Using unencrypted gRPC channel
[INFO  2026-06-24T16:55:17Z] Failed to connect to server: transport error

Caused by:
    0: tcp connect error
    1: tcp connect error
    2: Connection refused (os error 111), retrying in 1s
[INFO  2026-06-24T16:55:18Z] Attempting to connect to gRPC server...
[WARN  2026-06-24T16:55:18Z] Using unencrypted gRPC channel
[INFO  2026-06-24T16:55:18Z] Failed to connect to server: transport error

Caused by:
    0: tcp connect error
    1: tcp connect error
    2: Connection refused (os error 111), retrying in 1s
[INFO  2026-06-24T16:55:19Z] Attempting to connect to gRPC server...
[WARN  2026-06-24T16:55:19Z] Using unencrypted gRPC channel
[INFO  2026-06-24T16:55:19Z] Failed to connect to server: transport error

Caused by:
    0: tcp connect error
    1: tcp connect error
    2: Connection refused (os error 111), retrying in 1s
[INFO  2026-06-24T16:55:20Z] Attempting to connect to gRPC server...
[WARN  2026-06-24T16:55:20Z] Using unencrypted gRPC channel
[INFO  2026-06-24T16:55:20Z] Exiting...
[INFO  2026-06-24T16:55:20Z] Stopping config reloader...
[INFO  2026-06-24T16:55:20Z] Stopping endpoints...
Error: Output worker errored out: grpc worker errored out: gRPC error: Failed to connect to server: reconnection attempts exhausted

Summary by CodeRabbit

  • New Features

    • Added a configurable limit for gRPC reconnect attempts, with support for command-line, environment variable, and YAML configuration.
    • Improved shutdown behavior so background work stops more cleanly when the app is asked to exit or reload.
  • Bug Fixes

    • gRPC connection failures now stop after the configured retry limit instead of retrying forever.
    • Connection and worker errors are now surfaced more clearly during startup and runtime.

@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Enterprise

Run ID: 5f508bbe-adc6-43ad-92f3-0830cca81676

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • ✅ Review completed - (🔄 Check again to review again)
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch mauro/feat/crash-on-reconnect-fail

Comment @coderabbitai help to get the list of available commands.

@codecov-commenter

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 69.86301% with 44 lines in your changes missing coverage. Please review.
✅ Project coverage is 34.35%. Comparing base (5dd2a85) to head (a6f8c0e).

Files with missing lines Patch % Lines
fact/src/output/mod.rs 0.00% 20 Missing ⚠️
fact/src/lib.rs 0.00% 14 Missing ⚠️
fact/src/output/grpc.rs 90.09% 10 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #902      +/-   ##
==========================================
+ Coverage   32.72%   34.35%   +1.62%     
==========================================
  Files          21       21              
  Lines        2695     2771      +76     
  Branches     2695     2771      +76     
==========================================
+ Hits          882      952      +70     
- Misses       1810     1816       +6     
  Partials        3        3              

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

If fact fails to connect to the gRPC server in the specified number of
attemtps it will crash, which can be used as a clear signal in k8s that
something is not working as expected. If the number of retries is set to
0, there will be no limit to the amount of times fact attempts to
reconnect.

In order to allow the reconnection failure to trigger an application
wide crash, the main output task monitors the result of the grpc task's
handle propagating the error further up. This method should allow for
other output components to be added in the future and follow this same
pattern without having to change anything outside the output module. The
stdout component is not included in this logic because it has no
condition that could merit an application wide crash.

We also now propagate the error of worker tasks all the way up to the
termination of the application.
@Molter73 Molter73 force-pushed the mauro/feat/crash-on-reconnect-fail branch from a6f8c0e to 669f5be Compare June 24, 2026 16:54

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
fact/src/output/mod.rs (1)

47-74: 📐 Maintainability & Code Quality | 🔵 Trivial | 🏗️ Heavy lift

Add focused coverage for gRPC supervision failures.

This loop is the new contract that converts Ok(Err(_)) and join failures into output-level errors, but the PR coverage report shows this file has no patch coverage. A small async test or extracted helper around the match res path would protect the shutdown behavior.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@fact/src/output/mod.rs` around lines 47 - 74, Add focused test coverage for
the gRPC supervision path in the output loop so shutdown/error handling is
exercised. The `tokio::spawn` block in `output::mod` currently maps
`grpc_handle.borrow_mut()` results into output-level failures, so add a small
async test or extract a helper around the `match res` branch to verify both
`Ok(Err(_))` and join/task errors are handled as expected.
fact/src/config/tests.rs (1)

567-582: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Add the negative retries parse-error case.

Once the parser rejects negative retry counts, cover retries: -1 here so YAML validation matches the new retry-budget contract.

Proposed test case
         (
             r#"
             grpc:
               backoff:
                 retries: true
             "#,
             "invalid grpc.backoff.retries: Boolean(true)",
         ),
+        (
+            r#"
+            grpc:
+              backoff:
+                retries: -1
+            "#,
+            "invalid grpc.backoff.retries: Integer(-1)",
+        ),

As per coding guidelines, fact/src/config/mod.rs: "Add unit tests in fact/src/config/tests.rs for configuration schema changes in fact/src/config/mod.rs."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@fact/src/config/tests.rs` around lines 567 - 582, The config parsing tests in
tests.rs are missing coverage for the new negative retry validation in
grpc.backoff.retries. Add a unit test case alongside the existing retries
parse-error assertions in the config tests to verify that retries: -1 is
rejected with the appropriate invalid grpc.backoff.retries error, using the same
pattern as the current retries validation cases.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@fact/src/config/mod.rs`:
- Around line 416-420: Reject negative YAML retry counts in the retries parsing
branch of the config loader, since converting a signed value to u64 can wrap and
bypass the intended retry limit. Update the logic in the backoff/retries
handling inside mod.rs to accept 0 and positive values only, and explicitly bail
out when v.as_i64() returns a value below zero before assigning to
backoff.retries_max. Also add unit coverage in fact/src/config/tests.rs for the
retries configuration path to verify negative values fail and zero remains
valid.

In `@fact/src/output/grpc.rs`:
- Around line 217-220: The reconnect exhaustion path in gRPC output handling
drops the last connection failure, so update the backoff termination branch in
the connection retry loop to preserve the final error details when reconnection
attempts are exhausted. In the logic around the backoff handling in the gRPC
connection code, make the terminal bail/error include the original connection
error value from the retry loop so callers can see the actual
DNS/TLS/refused-connection cause instead of only “attempts exhausted”.

---

Nitpick comments:
In `@fact/src/config/tests.rs`:
- Around line 567-582: The config parsing tests in tests.rs are missing coverage
for the new negative retry validation in grpc.backoff.retries. Add a unit test
case alongside the existing retries parse-error assertions in the config tests
to verify that retries: -1 is rejected with the appropriate invalid
grpc.backoff.retries error, using the same pattern as the current retries
validation cases.

In `@fact/src/output/mod.rs`:
- Around line 47-74: Add focused test coverage for the gRPC supervision path in
the output loop so shutdown/error handling is exercised. The `tokio::spawn`
block in `output::mod` currently maps `grpc_handle.borrow_mut()` results into
output-level failures, so add a small async test or extract a helper around the
`match res` branch to verify both `Ok(Err(_))` and join/task errors are handled
as expected.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Enterprise

Run ID: 81043491-08dc-4da7-a651-5e8569ee9b26

📥 Commits

Reviewing files that changed from the base of the PR and between 818c659 and 669f5be.

📒 Files selected for processing (5)
  • fact/src/config/mod.rs
  • fact/src/config/tests.rs
  • fact/src/lib.rs
  • fact/src/output/grpc.rs
  • fact/src/output/mod.rs

Comment thread fact/src/config/mod.rs
Comment on lines +416 to +420
"retries" => {
let Some(retries) = v.as_i64() else {
bail!("invalid grpc.backoff.retries: {v:?}");
};
backoff.retries_max = Some(retries as u64);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Reject negative YAML retry counts before casting.

Line 420 turns values like retries: -1 into a huge u64, effectively bypassing the configured retry budget instead of failing configuration parsing. Keep 0 valid, but reject values below zero.

Proposed fix
                 "retries" => {
-                    let Some(retries) = v.as_i64() else {
+                    let Some(retries) = v.as_i64().filter(|retries| *retries >= 0) else {
                         bail!("invalid grpc.backoff.retries: {v:?}");
                     };
                     backoff.retries_max = Some(retries as u64);
                 }

As per coding guidelines, fact/src/config/mod.rs: "Add unit tests in fact/src/config/tests.rs for configuration schema changes in fact/src/config/mod.rs."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"retries" => {
let Some(retries) = v.as_i64() else {
bail!("invalid grpc.backoff.retries: {v:?}");
};
backoff.retries_max = Some(retries as u64);
"retries" => {
let Some(retries) = v.as_i64().filter(|retries| *retries >= 0) else {
bail!("invalid grpc.backoff.retries: {v:?}");
};
backoff.retries_max = Some(retries as u64);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@fact/src/config/mod.rs` around lines 416 - 420, Reject negative YAML retry
counts in the retries parsing branch of the config loader, since converting a
signed value to u64 can wrap and bypass the intended retry limit. Update the
logic in the backoff/retries handling inside mod.rs to accept 0 and positive
values only, and explicitly bail out when v.as_i64() returns a value below zero
before assigning to backoff.retries_max. Also add unit coverage in
fact/src/config/tests.rs for the retries configuration path to verify negative
values fail and zero remains valid.

Source: Coding guidelines

Comment thread fact/src/output/grpc.rs
Comment on lines +217 to +220
let Some(delay) = backoff.next() else {
bail!("Failed to connect to server: reconnection attempts exhausted");
};
info!("Failed to connect to server: {e:?}, retrying in {delay:?}");

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Preserve the final connection error on exhaustion.

Line 218 returns only “attempts exhausted”, so the terminal error can hide whether the last failure was DNS, TLS, refused connection, etc.

Proposed fix
                     let Some(delay) = backoff.next() else {
-                        bail!("Failed to connect to server: reconnection attempts exhausted");
+                        bail!(
+                            "Failed to connect to server: reconnection attempts exhausted after last error: {e:?}"
+                        );
                     };
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
let Some(delay) = backoff.next() else {
bail!("Failed to connect to server: reconnection attempts exhausted");
};
info!("Failed to connect to server: {e:?}, retrying in {delay:?}");
let Some(delay) = backoff.next() else {
bail!(
"Failed to connect to server: reconnection attempts exhausted after last error: {e:?}"
);
};
info!("Failed to connect to server: {e:?}, retrying in {delay:?}");
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@fact/src/output/grpc.rs` around lines 217 - 220, The reconnect exhaustion
path in gRPC output handling drops the last connection failure, so update the
backoff termination branch in the connection retry loop to preserve the final
error details when reconnection attempts are exhausted. In the logic around the
backoff handling in the gRPC connection code, make the terminal bail/error
include the original connection error value from the retry loop so callers can
see the actual DNS/TLS/refused-connection cause instead of only “attempts
exhausted”.

@Molter73 Molter73 marked this pull request as ready for review June 24, 2026 17:20
@Molter73 Molter73 requested a review from a team as a code owner June 24, 2026 17:20
Comment thread fact/src/config/tests.rs
},
..Default::default()
},
),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some tests with negative retries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants